Data need to be imported, wrangled, explored, and analyzed in order to make contributions to legitimate bodies of knowledge.
R is expert at these computational parts of the research process, It is a free (to use, not necessarily to develop!) and open programming language that helps you teach a computer to perform a “data science” task. Note that R was designed to do statistics/data science from its conception and has evolved that way.
RStudio is the graphical user interface/environment that we will program R inside of. It is really helpful for staying organized and offers much more functionality than the R language provides by itself. RStudio is also highly customizable. Read the customization guide by clicking here.
So, where do we type instructions for these data related tasks
Scripts are simply plain text files with a .R file extension. Click
File --> New File --> R Script to open one. In R
scripts: - Code is entered normally - Hashtags # are used
to comment your code, or make notes to yourself and others
R Markdown files utilize markdown, a lightweight markup language, to allow you to intersperse code and formattable text. Read the guide to learn how to format your R Markdown files.
Enter code in chunks and press
Control + Enter to run a line of code, run multiple
highlighted lines of code, or press the green “Play” button on the right
side to run the entire chunk.
2 + 2
## [1] 4
cos(pi)
## [1] -1
R Projects files (.Rproj) are associated with your working directory (the location on your computer that your actions are relevant to) and are great for staying organized by organizing your project and its context into a single location. The working directory is set and retrieved via a file path.
You can change the working directory location with the
setwd()function, or by clicking “Session” –> “Set Working Directory” –> “Choose Directory” from the top toolbar menu. However,
You will find that no matter how complex of a task you are faced with, you will pretty much always rely on the same basic building blocks to accomplish it. However, it is up to you to arrange them in specific ways to accomplish something creative and meaningful.
Everything that exists in R is an object or some piece of information/data that can be manipulated.
Almost everything that happens is a function call,
or a command that performs an action on a “thing”. Functions always
contain trailing round parentheses ().
These “things” are called arguments and are defined in function creation as parameters.
Variables (which are also objects) are how data are saved in R’s memory, which lives in a physical location on your computer. These are just placeholders and can contain virtually anything: values, mathematical expressions, text, functions, or entire datasets.
<- is the assignment operator. It assigns values on
the right to objects on the left. So, after executing
x <- 5, the value of x is
5.
You can read this in plain language as “x is defined as 5”, “5 is assigned to x”, or most simply as “x is equal to 5”.
This performs the assignment step. Note that the variable now appears on your “Environment” tab.
Type the name of the variable and run it (“call” the variable) to show it on the screen.
x <- 5
x
## [1] 5
Like everything else in R, data have a class (type) associated with
it which determines how we can manipulate it. Use the
class() function to find out. In this case, the variable
x is the argument.
We will talk about five basic types:
Numeric: decimals; the default class for all numbers in R
Character: text, always wrapped in quotations " "
(single or double are fine, be consistent)
Logical: TRUE and FALSE; stored internally as 1 and 0 and as such take on mathematical properties; useful for turning function parameters “on” or “off”, as well as for data subsetting (see below).
Integer: positive and negative whole numbers, including zero
Factor: categorical data
Numeric class
class(x)
## [1] "numeric"
# use a hashtag within a code chunk to comment your code
# note the use of the underscore to represent a space in the variable name
my_name <- "Evan"
my_name
## [1] "Evan"
class(my_name)
## [1] "character"
TRUE + 1
## [1] 2
FALSE - 1
## [1] -1
# is 3 greater than 4?
3 > 4
## [1] FALSE
# is 4 less than or equal to 4?
4 <= 4
## [1] TRUE
# is "Apple" equal to "apple" (hint: R is case sensitive!)
"Apple" == "apple"
## [1] FALSE
# is "Apple" not equal to "Apple"
"Apple" != "Apple"
## [1] FALSE
We can use various “as dot” functions to convert data types. To convert numeric to integer class for example, we could type:
y <- as.integer(x)
y
## [1] 5
class(y)
## [1] "integer"
Other “as dot” functions exist as well: as.character(),
as.numeric(), and as.factor() to name a
few:
school <- "Stanford University"
school
## [1] "Stanford University"
class(school)
## [1] "character"
# convert to factor
school_fac <- as.factor(school)
school_fac
## [1] Stanford University
## Levels: Stanford University
class(school_fac)
## [1] "factor"
It might seem difficult keeping track of variables you define, but
remember they are listed in RStudio’s “Environment” tab. You can also
type ls() to print them to the console.
ls()
## [1] "my_name" "school" "school_fac" "x" "y"
Remove a single variable with rm()
rm(my_name)
my_name # Error
Remember to use autocomplete when typing a function or variable name, since there is great potential for humans to make syntactical errors
Alternatively, you can wipe your Environment clean by clicking the yellow broom icon on the Environment tab or by typing
rm(list = ls())
To completely restart your R session, click “Session” –> “Restart R” from the top toolbar menu.
If saving one piece of data in a variable is good, saving many is
better. Use the c() function to combine multiple pieces of
data into a vector, which is an ordered group of the
same type of data.
We can nest the c() function inside of “as dot”
functions to create vectors of different types.
# example numeric (default) vector
traffic_stops <- c(8814, 9915, 9829, 10161, 6810, 8991)
# Integer, logical, and factor vectors
city <- as.factor(c("SF", "DC", "DC", "DC", "SF", "SF"))
year <- as.integer(c(2000, 2000, 2001, 2002, 2001, 2002))
# Call these variables to print them to the screen and check their class
traffic_stops
## [1] 8814 9915 9829 10161 6810 8991
city
## [1] SF DC DC DC SF SF
## Levels: DC SF
year
## [1] 2000 2000 2001 2002 2001 2002
class(traffic_stops)
## [1] "numeric"
class(city)
## [1] "factor"
class(year)
## [1] "integer"
A data frame is an ordered group of equal-length vectors.
More simply put, a data frame is a tabular data structure organized into horizontal rows and vertical columns, i.e. a spreadsheet! These are often stored as comma separated values (.csv) files, or plain text where commas are used to delineate column breaks and that look good in spreadsheet programs like Microsoft Excel.
We can assemble our three vectors from above into a data frame with
the data.frame() function.
police <- data.frame(city, traffic_stops, year)
police
class(police)
## [1] "data.frame"
# display the compact structure of a data frame
str(police)
## 'data.frame': 6 obs. of 3 variables:
## $ city : Factor w/ 2 levels "DC","SF": 2 1 1 1 2 2
## $ traffic_stops: num 8814 9915 9829 10161 6810 ...
## $ year : int 2000 2000 2001 2002 2001 2002
# class = data.frame
# 6 observations (rows)
# 3 variables (columns, or vectors)
# column names are preceded by the dollar sign
Create a dataframe that contains 6 rows and 3 columns by following the instructions above.
Advanced: what is the difference between a data frame and a tidyverse tibble?
## YOUR CODE HERE
A vector can be indexed (positionally referenced) by typing its name followed by its index within square brackets. For example, if we want to index just the first thing in the city vector, we could type
Note that R is a “one-indexed” programming language. This means that counting anything starts at 1.
city[1]
## [1] SF
## Levels: DC SF
# or
police$city[1]
## [1] SF
## Levels: DC SF
If we want to return just the third thing in traffic_stops, we would type
traffic_stops[3]
## [1] 9829
# or
police$traffic_stops[3]
## [1] 9829
$Note that columns are preceded by the dollar sign $. You
can access a single column by typing the name of your data frame, the
$, and then the column name. Note that autocomplete works
for much more than just function and variable names!
# show just the column containing the number of traffic stops
police$traffic_stops
## [1] 8814 9915 9829 10161 6810 8991
# ... which can then easily be plugged into another function
hist(police$traffic_stops)
[,]This can also be extended to rows and columns using bracket notation
[,]
Type the name of your data frame followed by square brackets with a comma inbetween them.
Here, we can enter two indices: one for the rows (before the comma)
and one for the columns (after the comma) like this:
[rows, cols]
For example, if we want two columns, we cannot use the dollar sign operator (since it only works for single columns), but we could type either the indices or the columns names as a vector!
If either the row or column position is left blank, all rows/columns will be retured becuase no subset was specified.
To subset the police with just the city name and number
of stops columns, type
city_and_stops <- police[,c(1,2)]
city_and_stops
# or, for consecutive sequences
city_and_stops <- police[,1:2]
city_and_stops
# or using variable names
city_and_stops <- police[,c("city", "traffic_stops")]
city_and_stops
Keep in mind that redefining a variable will overwrite it each time, as we are doing here.
We can do the same thing for rows by adding a vector of the row indices to include. For example, to keep just rows 1, 2, and 3 along with columns “city” and “traffic_stops” we could type:
subset1 <- police[1:3, c("city", "traffic_stops")]
subset1
Or, to keep rows 1, 2, and 4 along with “city” and “traffic_stops” columns:
subset2 <- police[c(1,2,4), c("city", "traffic_stops")]
subset2
Subset by logical condition by using the logical operators discussed
above: ==, >, <=, etc.
For example, if you want to subset only rows with stops less than 9000 you would combine the dollar sign operator along with bracket notation.
This performs a row subsetting operation based on the condition of a column. Note that the column position is left blank after the comma to indicate all columns should be retured.
low_stops <- police[police$traffic_stops < 9000, ]
low_stops
Or, to include multiple conditions use logical and &
(all conditions must be satisfied) and logical or | (just
one condition must be satisfied).
To subset just rows that contain SF as the city and stops less than 7000, type
sf_low_stops <- police[police$city == "SF" & police$traffic_stops < 7000, ]
sf_low_stops
Create a subset that contains data from DC or stops less than or equal to 7000 and just columns “city” and “traffic_stops”
Advanced: use the filter() and select()
functions from the dplyr R package to do the same
thing.
## YOUR CODE HERE
R provides excellent help documentation that often tells you everything to know, but you might not know it yet!
Type a question mark before a function name to view the help pages.
Read the Description section to understand what the function does.
The Usage section describes the parameters that you can pass arguments to.
The Arguments list defines how the arguments work.
Often included (but not always) below these sections are Detiails that offer more information, Value that describes the returned objects, Notes, Authors, References and copy/paste Examples to experiment with.
?c
?data.frame
# look at help pages for linear regression
?lm
# generalized linear models
?glm
# arithmetic mean
?mean
# histogram
?hist
# Wrap symbols in quotations to view their help files
?">"
?"&"
Now that you have at least some basic familiarity with basic R syntax and functionality, let’s work through a computational workflow example.
Because R has evolved for statistical functionality, it has considerable variety in its built-in functionality. This means that you can accomplish many tasks without having to install additional user-defined software packages that contain shortcuts for virtually every discipline and that make R work smarter for you.
However, you will likely find yourself needing functionality beyond what is built into R’s base installation.
# Step 1. Physically download the files to your computer (Internet connection required)
install.packages("psych")
# Step 2. Import the package's functionality with the library() function and link it to your current RStudio session (do this each time you close and reopen RStudio)
library(psych)
# you can also call help on some packages to view vignette's, or purposeful coding walkthroughs with double question marks
# Click "Introduction to the psych package - PDF" to learn more
??psych
?describe
?describeBy
# summary statistics for pooled police
describe(police)
# summary statistics for police, by group
describeBy(police, group = police$city)
##
## Descriptive statistics by group
## group: DC
## vars n mean sd median trimmed mad min max range skew
## city* 1 3 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## traffic_stops 2 3 9968.33 172.31 9915 9968.33 127.50 9829 10161 332 0.28
## year 3 3 2001.00 1.00 2001 2001.00 1.48 2000 2002 2 0.00
## kurtosis se
## city* NaN 0.00
## traffic_stops -2.33 99.48
## year -2.33 0.58
## ------------------------------------------------------------
## group: SF
## vars n mean sd median trimmed mad min max range skew
## city* 1 3 2 0.00 2 2 0.00 2 2 0 NaN
## traffic_stops 2 3 8205 1211.34 8814 8205 262.42 6810 8991 2181 -0.38
## year 3 3 2001 1.00 2001 2001 1.48 2000 2002 2 0.00
## kurtosis se
## city* NaN 0.00
## traffic_stops -2.33 699.37
## year -2.33 0.58
If you have multiple packages to install, consider vectorizing them.
Alternatively, package managers such as pacman provide increased and friendly usability.
NOTE: type no in your console and press the Return key if prompted with “Do you want to install from sources the package which needs compilation? (Yes/no/cancel)”. If asked to update packages, type 1 in your console and press the Return key unless you have specific
# Step 1. Install the files
install.packages(c("psych", "tidyr", "dplyr", "ggplot2", "broom", "tidycensus"))
# Step 2. Import them to your current session
library(psych)
library(tidyr)
library(dplyr)
library(ggplot2)
library(broom)
library(tidycensus)
Import the Mesa, Arizona, USA, dataset from The Stanford Open Policing Project.
mesa <- read.csv("data/raw/az_mesa_2020_04_01.csv",
stringsAsFactors = TRUE)
str(mesa)
## 'data.frame': 96621 obs. of 19 variables:
## $ raw_row_number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ date : Factor w/ 1186 levels "2014-01-01","2014-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ time : Factor w/ 1440 levels "00:00:00","00:01:00",..: 305 754 748 833 183 93 968 1078 90 474 ...
## $ location : Factor w/ 10830 levels "& STAPLEY, MESA",..: 7553 5263 5072 5372 3504 2212 2184 2184 3149 4202 ...
## $ lat : num 33.4 33.4 33.4 33.4 33.4 ...
## $ lng : num -112 -112 -112 -112 -112 ...
## $ subject_age : int 36 58 19 46 27 58 24 27 27 29 ...
## $ subject_race : Factor w/ 6 levels "asian/pacific islander",..: 3 6 6 2 6 6 6 6 6 3 ...
## $ subject_sex : Factor w/ 2 levels "female","male": 1 1 2 2 2 2 2 2 2 1 ...
## $ officer_id_hash : Factor w/ 770 levels "0027657d5f","003bde5bfa",..: 637 677 226 529 272 272 754 754 128 146 ...
## $ type : Factor w/ 2 levels "pedestrian","vehicular": NA 2 2 2 2 2 NA NA 2 2 ...
## $ violation : Factor w/ 716 levels "2 OCCUPANTS UNDER 18 CLASS G LICENSE",..: 405 391 614 640 125 157 225 595 316 65 ...
## $ arrest_made : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ citation_issued : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ warning_issued : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ outcome : Factor w/ 2 levels "arrest","citation": 2 2 2 2 2 2 1 1 2 2 ...
## $ raw_race_fixed : Factor w/ 5 levels "A","B","I","U",..: 5 5 5 2 5 5 5 5 5 5 ...
## $ raw_ethnicity_fixed: Factor w/ 3 levels "H","N","U": 1 2 2 2 2 2 2 2 2 1 ...
## $ raw_charge : Factor w/ 66 levels "-----","------",..: 33 9 32 7 5 5 4 6 44 32 ...
# view the first six rows by default
head(mesa)
# count number of missing values
sum(is.na(mesa))
## [1] 13324
# calculate proportion of missing data
(sum(is.na(mesa))) / (nrow(mesa) * ncol(mesa))
## [1] 0.007257875
Subset only columns: date, subject_race, subject_sex, subject_age, officer_id_hash, violation, arrest_made
clean <- mesa[, c("date", "subject_race", "subject_sex", "subject_age", "officer_id_hash", "violation", "arrest_made")]
str(clean)
## 'data.frame': 96621 obs. of 7 variables:
## $ date : Factor w/ 1186 levels "2014-01-01","2014-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ subject_race : Factor w/ 6 levels "asian/pacific islander",..: 3 6 6 2 6 6 6 6 6 3 ...
## $ subject_sex : Factor w/ 2 levels "female","male": 1 1 2 2 2 2 2 2 2 1 ...
## $ subject_age : int 36 58 19 46 27 58 24 27 27 29 ...
## $ officer_id_hash: Factor w/ 770 levels "0027657d5f","003bde5bfa",..: 637 677 226 529 272 272 754 754 128 146 ...
## $ violation : Factor w/ 716 levels "2 OCCUPANTS UNDER 18 CLASS G LICENSE",..: 405 391 614 640 125 157 225 595 316 65 ...
## $ arrest_made : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
# or
clean <- mesa[, c(2, 8, 9, 7, 10, 12, 13)]
str(clean)
## 'data.frame': 96621 obs. of 7 variables:
## $ date : Factor w/ 1186 levels "2014-01-01","2014-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ subject_race : Factor w/ 6 levels "asian/pacific islander",..: 3 6 6 2 6 6 6 6 6 3 ...
## $ subject_sex : Factor w/ 2 levels "female","male": 1 1 2 2 2 2 2 2 2 1 ...
## $ subject_age : int 36 58 19 46 27 58 24 27 27 29 ...
## $ officer_id_hash: Factor w/ 770 levels "0027657d5f","003bde5bfa",..: 637 677 226 529 272 272 754 754 128 146 ...
## $ violation : Factor w/ 716 levels "2 OCCUPANTS UNDER 18 CLASS G LICENSE",..: 405 391 614 640 125 157 225 595 316 65 ...
## $ arrest_made : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
Get some quick facts about your data with the below functions
# Tabulate frequencies of a single variable
table(clean$subject_race)
##
## asian/pacific islander black hispanic
## 1278 6459 16626
## other unknown white
## 2305 7092 62861
table(clean$arrest_made)
##
## FALSE TRUE
## 95113 1508
# Compute crosstabs
table(clean$subject_race, clean$arrest_made)
##
## FALSE TRUE
## asian/pacific islander 1268 10
## black 6255 204
## hispanic 16181 445
## other 2183 122
## unknown 7088 4
## white 62138 723
# Produce summary statistics
# Six number summaries (minimum and maximum values, 1st and 3rd quartile boundaries, median, and mean) are returned for quantitative variables (i.e., subject_age)
# Frequencies are returned for non quantitative variables (all other columns)
summary(clean)
## date subject_race subject_sex
## 2015-05-28: 224 asian/pacific islander: 1278 female:39083
## 2017-03-30: 212 black : 6459 male :56248
## 2015-08-20: 202 hispanic :16626 NA's : 1290
## 2015-05-19: 197 other : 2305
## 2015-03-02: 188 unknown : 7092
## 2015-04-16: 186 white :62861
## (Other) :95412
## subject_age officer_id_hash
## Min. :10.00 cb50401ff4: 3563
## 1st Qu.:25.00 2ea1bea91b: 2599
## Median :34.00 aea894f584: 2594
## Mean :37.41 cf084c69a3: 2571
## 3rd Qu.:47.00 29dac4e72e: 2550
## Max. :99.00 96fcddf228: 2467
## NA's :1448 (Other) :80277
## violation arrest_made
## SPEED NOT R&P/FTC SPEED TO AVOID A COLLISION:30021 Mode :logical
## EXPIRED REGISTRATION : 6622 FALSE:95113
## NO PROOF OF INSURANCE : 4547 TRUE :1508
## M / I SUSPENSION REGISTRATION : 3624
## NO VALID DRIVERS LICENSE - IN/OUT OF STATE : 3548
## (Other) :48211
## NA's : 48
# use the describeBy() function from the psych package
# by sex
summary_sex <- describeBy(clean, group = clean$subject_sex)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
summary_sex
##
## Descriptive statistics by group
## group: female
## vars n mean sd median trimmed mad min max range
## date* 1 39083 565.99 331.20 558 560.09 389.92 1 1186 1185
## subject_race* 2 39083 5.10 1.45 6 5.36 0.00 1 6 5
## subject_sex* 3 39083 1.00 0.00 1 1.00 0.00 1 1 0
## subject_age 4 38960 37.69 15.45 34 36.10 14.83 14 99 85
## officer_id_hash* 5 39083 405.43 227.91 434 410.13 289.11 1 770 769
## violation* 6 39063 432.18 196.31 450 445.88 243.15 1 716 715
## arrest_made 7 39083 NaN NA NA NaN NA Inf -Inf -Inf
## skew kurtosis se
## date* 0.12 -1.01 1.68
## subject_race* -1.24 -0.01 0.01
## subject_sex* NaN NaN 0.00
## subject_age 0.85 0.18 0.08
## officer_id_hash* -0.14 -1.32 1.15
## violation* -0.39 -1.32 0.99
## arrest_made NA NA NA
## ------------------------------------------------------------
## group: male
## vars n mean sd median trimmed mad min max range
## date* 1 56248 557.91 331.74 548 550.68 392.89 1 1186 1185
## subject_race* 2 56248 4.98 1.51 6 5.22 0.00 1 6 5
## subject_sex* 3 56248 2.00 0.00 2 2.00 0.00 2 2 0
## subject_age 4 56101 37.20 15.47 34 35.58 16.31 10 99 89
## officer_id_hash* 5 56248 400.74 226.67 432 404.40 286.14 1 770 769
## violation* 6 56220 408.01 203.10 392 419.11 329.14 1 711 710
## arrest_made 7 56248 NaN NA NA NaN NA Inf -Inf -Inf
## skew kurtosis se
## date* 0.14 -1.02 1.40
## subject_race* -1.03 -0.57 0.01
## subject_sex* NaN NaN 0.00
## subject_age 0.85 0.10 0.07
## officer_id_hash* -0.10 -1.31 0.96
## violation* -0.26 -1.37 0.86
## arrest_made NA NA NA
# Or, just access the female data, and subset just subject_age and columns 3 and 4 (mean and sd)
mesa_female <- summary_sex$female["subject_age" ,3:4]
mesa_female
# or
mesa_female <- summary_sex$female["subject_age", c("mean", "sd")]
mesa_female
Perhaps there are differences in sex, race, and age within our data. Let’s eyeball it to find out!
A histogram can be used to view the distribution of one variable.
hist(clean$subject_age)
hist(clean$subject_age,
breaks = 5,
main = "Include a title")
hist(clean$subject_age,
breaks = 50,
col = "gray80",
main = "Age distribution of Mesa, AZ USA traffic stops\n December 2-13 - March 2017",
xlab = "Driver Age",
ylab = "Frequency",
las = 1)
# Type colors() to see the 657 stock colors available
# colors()
The data are positively skewed and we know from our summary statistics that the mean is ~37 years old and is widely dispersed.
Boxplots can be used similarly to histograms to view the distribution of one variable, but we can conveniently plot categorical groupings side by side for interpretation.
# subject_sex
boxplot(clean$subject_age ~ clean$subject_sex)
# subject_race
boxplot(clean$subject_age ~ clean$subject_race)
We can now ask a question such as: are there statistically significant differences in arrests made by race? Something like a one-way analysis of variance (ANOVA) model can help us understand.
aov_race <- aov(clean$arrest_made ~ clean$subject_race)
summary(aov_race)
## Df Sum Sq Mean Sq F value Pr(>F)
## clean$subject_race 5 9.7 1.9342 126.7 <2e-16 ***
## Residuals 96615 1474.8 0.0153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This p-value indicates that there are statistically significant differences by race. But, how do we know between which groups? We can use a post-hoc test to find out the pairwise differences, along with adjusted p-values.
TukeyHSD(aov_race)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = clean$arrest_made ~ clean$subject_race)
##
## $`clean$subject_race`
## diff lwr upr
## black-asian/pacific islander 0.023759110 0.012980037 0.0345381842
## hispanic-asian/pacific islander 0.018940581 0.008720392 0.0291607704
## other-asian/pacific islander 0.045103690 0.032824613 0.0573827675
## unknown-asian/pacific islander -0.007260710 -0.017960030 0.0034386091
## white-asian/pacific islander 0.003676841 -0.006271445 0.0136251263
## hispanic-black -0.004818529 -0.009980691 0.0003436328
## other-black 0.021344580 0.012802251 0.0298869090
## unknown-black -0.031019821 -0.037075486 -0.0249641551
## white-black -0.020082270 -0.024682708 -0.0154818313
## other-hispanic 0.026163109 0.018337816 0.0339884027
## unknown-hispanic -0.026201292 -0.031194779 -0.0212078037
## white-hispanic -0.015263740 -0.018334224 -0.0121932571
## unknown-other -0.052364401 -0.060805869 -0.0439229322
## white-other -0.041426850 -0.048893531 -0.0339601678
## white-unknown 0.010937551 0.006527218 0.0153478842
## p adj
## black-asian/pacific islander 0.0000000
## hispanic-asian/pacific islander 0.0000019
## other-asian/pacific islander 0.0000000
## unknown-asian/pacific islander 0.3812921
## white-asian/pacific islander 0.8996599
## hispanic-black 0.0834401
## other-black 0.0000000
## unknown-black 0.0000000
## white-black 0.0000000
## other-hispanic 0.0000000
## unknown-hispanic 0.0000000
## white-hispanic 0.0000000
## unknown-other 0.0000000
## white-other 0.0000000
## white-unknown 0.0000000
Read the du Prel et al. article “Confidence Interval or P-Value” linked in the README file to learn more about hypothesis testing, point estimations, and confidence in analysis.
Results of a one-way ANOVA could be interpreted to suggest that white
drivers are stopped less than all other groups. Why might this be true?
Is there bias in our data? Do sample size differences influence the
results? Did we violate any assumptions
of the test? Would a non-parametric alternative such as
kruskal.test() be more appropriate? These are the tough
questions that are up to the researcher to investigate.
BINDER USERS! CLick the “More” button on the “Files” tab in Binder, and then select “Export” to download files or folders to your computer. Screenshot below:
binder_export
There are many ways to import and export data in R.
We can save datasets and tables we create with the
write.csv() function. The first argument is the variable
name, the second argument is the file path/name.
Note that we save this in the “preprocessed” data subfolder, and could even save it in a separate directory named “Tables” or something like that.
write.csv(police, file = "data/preprocessed/police.csv")
Where does it save? Use getwd() to find out. Also check
out the here
R package vignette to learn more efficient ways of structuring and
understanding
We specified the “data” –> “preprocessed” folder of the parent directory “sicss-howard-mathematica-R-2022” folder.
getwd()
## [1] "/Users/evanmuzzall/Desktop/sicss-howard-mathematica-R-2022"
We can also use the save() function to save variables in
an .RData file for easy loadindg in the next lesson. Save only the
clean dataset like this:
save(clean, file = "data/preprocessed/clean.RData")
“Reproducible” research is a term often discussed but rarely understood. It is something you want to build into your research design very early on! As a project nears completion, it is very difficult to make it reproducible in retrospect. Ask me if you want to learn more!
Replication = code + data
Computational reproduciblity = code + data + environment + distribution
Reproducibility checklist by Roger Peng
Start with science (avoid vague questions and concepts)
Don’t do things by hand (not only about automation but also documentation)
Don’t point and click (same problem)
Teach a computer (automation also solves documentation to some extent)
Use some version control
Don’t save output (instead, keep the input and code)
Set your seed
Think about the entire pipeline
Reproduce the above workflow using the dataset “ca_oakland_2020_04_01 2.csv”.
Ask the question: how does the distribution of arrests vary by race? Hint: use a barplot (you might have to generate frequencies first to make the barplot!).
Save the “ca_oakland_2020_04_01 2.csv” dataset in a variable
named oak. Use the save() function to save it in a file
name “oak.csv” in the “data/preprocessed” folder.
Advanced: Use ggplot2 to make the barplot.
What other statistical applications might be useful?